Investigating CO2 Emissions and Renewable Energy Usage

Author

Noel Lopez, Patrick George, Riley Svensson, Ningjing Hua

Code
library(plotly)
library(tidyverse)
library(broom)
library(knitr)
library(DT)
library(kableExtra)
library(gridExtra)
library(patchwork)
options(scipen = 99999)
energy <- readxl::read_xlsx(here::here("renewable_energy.xlsx"))
co2 <- readxl::read_xlsx(here::here("co2.xlsx"))
Code
convert_to_numeric <- function(x) {
  x <- str_replace_all(x, "k", "e3")
  x <- str_replace_all(x, "M", "e6")
  return(as.numeric(x))
}


co2_clean <- co2 |>
  select(country, `1989`:`2017`) |>
  mutate(across(.cols = `1989`:`2017`, 
                ~ convert_to_numeric(.))) |>
  pivot_longer(cols = !country,
               names_to = "year",
               values_to = "Total_CO2_Emissions")

energy_clean <- energy |>
  pivot_longer(cols = !country,
               names_to = "year",
               values_to = "Percent_Consumption_Renewable")

joined <- inner_join(co2_clean, energy_clean) |>
  drop_na()

joined |>
  distinct(year) |>
  count()

joined |>
  distinct(country) |>
  count()

master <- joined|>
  rename(Year = year) |>
  group_by(Year) |>
  mutate(Year = as.numeric(Year)) |>
  summarize(`Average Emissions` = mean(`Total_CO2_Emissions`),
            `Average Percent Renewable`
            = mean(`Percent_Consumption_Renewable`))

1 Introduction

Climate change is a global challenge jeopardizing humanity’s future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. Total_CO2_Emission is in 1000 tons of CO2 and Percent_Consumption_Renewables is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the Average Emissions as well as the Average Percent Renewable of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between Average Emissionsand Average Percent Renewable is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.

2 Materials and Methods

The master data to be used for analysis in this project incorporates the averages for all countries over the years from two different data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled Gapminder which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.

Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 – 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na’s in the master set due to a few reasons. The missing values typically occurred in the Average Percent Renewable variable across the earlier years measured and usually in countries with smaller populations. Whether these Na’s were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na’s usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.

Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.


Code
datatable((head(master, n = 50)),
          caption = 'Interactive Preview of Data Set')
<<<<<<< HEAD
=======
>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c


3 Analysis and Discussion of Model

Code
energy_emissions_model <- lm(`Average Emissions` ~ 
                             `Average Percent Renewable`,
                             data = master)
summary(energy_emissions_model)
tidy(energy_emissions_model)

Regression Equation:

\[ \hat{y} = 693309.6 - 17770.2 * Average\,Percent\,Renewable \]

Above is the simple linear regression equation based on the model predicting the response variable, Average Emissions (\(\hat{y}\)), by the explanatory variable, Average Percent Renewable. The coefficient on Average Percent Renewable is extremely negative sitting at -17770.2 meaning that for each one percent increase in the Average Percent Renewable energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient on the y-intercept of 693309.6 means that when the Average Percent Renewable is zero such that there is no renewable energy being used at all on average, the predicted Average Emissions would be 693309.6 thousand tons.


Code
# Plot 1
raw_graph <- master |> 
  ggplot(aes(x = `Average Percent Renewable`, 
            y = `Average Emissions`)
       ) +
  geom_point(color = 'aquamarine4', size = 0.9) + 
  geom_smooth(method = "lm", 
              color = 'lightgoldenrod4', 
              size = 0.5,
              se = FALSE) +
  theme(legend.position = "none") +
  labs(x = " Average Renewable Energy Consumption (%)", 
       y = "", 
       title = "Relationship between Renewable Energy Usage and CO2 Emissions", 
       subtitle = 'Average Emissions (1000 tons)') 


ggplotly(raw_graph) |>
  layout(title = list(text = paste0('Relationship between Renewable Energy Usage and CO2 Emissions',
                                    '<br>',
                                    '<sup>',
                                     'Average Emissions (1000 tons)',
                                    '</sup>')))
<<<<<<< HEAD
=======
>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c


The graph above demonstrates the relationship between Average Emissions and Average Percent Renewable. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the Average Percent Renewable increases the Average Emissions decreases at a significant rate.

Code
co2_by_year_graph <- master |>
  ggplot(aes(x = Year, y = `Average Emissions`)) +
  geom_point(color = 'firebrick') +
  geom_line(color = 'firebrick') +
  labs( y = '', 
        title = 'Average Emissions Over Time') 
energy_by_year_graph <- master |>
  ggplot(aes(x = Year, y = `Average Percent Renewable`)) +
  geom_point(color = 'green4') +
  geom_line(color = 'green4') +
  labs( y = '', 
        title = 'Average Emissions Over Time') 
ggplotly(co2_by_year_graph) |>
  layout(title = list(text = paste0('Average Emissions Over Time',
                                    '<br>',
                                    '<sup>',
                                     'Average Emissions (1000 tons)',
                                    '</sup>')))
<<<<<<< HEAD
=======
>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c
Code
ggplotly(energy_by_year_graph) |>
  layout(title = list(text = paste0('Average Percent Renewable Over Time',
                                    '<br>',
                                    '<sup>',
                                     'Average Renewable Energy Consumption (%)',
                                    '</sup>')))
<<<<<<< HEAD
=======
>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c


As shown by the two distributions of Average Emissions and Average Percent Renewable over time, the relationship follows a negative relationship, but perhaps not as expected. Average Percent Renewable is decreasing over the years while Average Emissions is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why Average Emissions is increasing, because of extreme population growth and growing demand for production but Average Percent Renewable has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.

Model Fit:

Code
energy_emissions_model |>
  augment() |>
  summarize(`Variance of Fitted` = var(.fitted),
            `Variance of Residuals` = var(.resid),
            `Variance of Average CO2 Emissions` = var(`Average Emissions`)) |>
  kable(caption = 'Model Fit',
        digits = 10,
        format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c('striped', 'bordered'))
Model Fit
Variance of Fitted Variance of Residuals Variance of Average CO2 Emissions
415,649,404 46,532,518 462,181,923

The proportion of variability in the response values that was accounted for by the model, \(R^{2}\), was particularly large, at about 89.93 percent. This suggests a high quality model, where a majority, 89%, of the variation in the response, Average Emissions is explained by the explanatory variable, Average Percent Renewable. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.

Code
model_stats <- energy_emissions_model |>
  augment() |>
  rename(Residual = .resid,
         Fitted = .fitted) 


residual_plot <- model_stats |>
  ggplot(aes(x = Fitted , y = Residual, color = Residual  >0)) +
  geom_point(show.legend = FALSE) +
  scale_colour_manual(values = 
                        setNames(c('darkslategray4','darkolivegreen4'),
                                 c(T, F))) +
  geom_hline(yintercept=0, 
             linetype='dotted', 
             color = 'deeppink3', show.legend = FALSE) +
  labs(y = '',
       x = 'Fitted Values',
       title = 'Relationship between Residual and Fitted Values') +
  theme(legend.position="none")


ggplotly(residual_plot) |>
  layout(title = list(text = paste0('Relationship between Residual and Fitted Values',
                                    '<br>',
                                    '<sup>',
                                     'Residuals',
                                    '</sup>')))
<<<<<<< HEAD
=======
>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c


We can see from the relationship between residual and fitted values graph that, the linearity is violated in the range of \(1.0\times10^5\) to \(1.4\times10^5\) (equal variance). This is because in the range of \(1.0\times10^5\) to \(1.2\times10^5\), the observed data is over the estimated linear regression line, which render only positive residues. And in the range of \(1.2\times10^5\) to \(1.4\times10^5\), the observed data is below the linear regression line, which render only negative residuals. In the range of \(1.4\times10^5\) to \(1.6\times10^5\), the linearity is not violated, which means this range of data might fit the linear regression model better.


3.1 Simulation

Simulation is a critical technology used to develop planning and explore models to optimize decision making within the world of data science (de Paula Ferreira et al., 2020).

In this section, we performed a basic linear model simulation to determine how well it fits with the presetting conditions, such as adding normally distributed error to the regression line.

The basic procedure for simulation is to:

  1. Train a linear regression fit model for the observed data.
  2. Assuming the model is correct. Add a generated error to the linear regression model (we generated normally distributed error for this study).
  3. Establish the simulated data, and compare values with the observed data (by generating value distribution graphs, scatterplots of the relationships modeled and observed, and y=x plot).
  4. Analyze and interpret the simulated \(R^2\) value.
  5. Iterate and generate additional simulated data sets.
  6. Check, interpret, and plot the simulated \(R^2\) values for the simulated data sets.

Simulation for a single data set:

Code
noise <- function(x, mean = 0, sd){
  x + rnorm(length(x), 
            mean, 
            sd)
}
master_predict <- predict(energy_emissions_model)
master_sigma <- sigma(energy_emissions_model)

sim_response <- tibble(sim_emissions = noise(master_predict, 
                                          sd = master_sigma))

obs_emissions <- master |>
  ggplot(aes(x = `Average Emissions`)) +
  geom_histogram(binwidth = 3000, fill = 'lightsteelblue4') +
  labs(x = "Observed Emissions/1000 tonnes",
       y = "",
       subtitle = "Count",
       title = "Worldwide Yearly Emission Counts") +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0, face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))

sim_emissions_graph <- sim_response |>
  ggplot(aes(x = sim_emissions)) +
  geom_histogram(binwidth = 3500, fill = 'lightsteelblue4') +
  labs(x = "Simulated Emissions/1000 tonnes",
       y = "",
       subtitle = "Count",
       title = "Simulated Worldwide Yearly Emission Counts") +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0, face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))

obs_emissions + sim_emissions_graph
<<<<<<< HEAD

=======

>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c

We can identify the differences between observed and simulated data from the distribution visualization. When we initially view the two graphs, they do not seem to share many similarities. However, when we take a closer look, the concentrated value ranges, in other words, where most of the data resides, are similar.

The left graph displays the distribution of observed average emissions over the years from 29 different countries. We can see that most of the values fall within the intervals of \(1.1\times10^5\) tons to \(1.2\times10^5\) tons and \(1.6\times10^5\) tons to \(1.7\times10^5\) tons. The simulated yearly average emissions are also concentrated in a range of \(1.1\times10^5\) tons to \(1.2\times10^5\) tons and \(1.5\times10^5\) tons to \(1.7\times10^5\) tons.

Code
sim_data <- master |> 
  filter(!is.na(`Average Emissions`), 
         !is.na(`Average Percent Renewable`)
         ) |> 
  select(`Average Emissions`, `Average Percent Renewable`) |> 
  bind_cols(sim_response)


raw_graph <- master |> 
ggplot(aes(x = `Average Percent Renewable`, 
           y = `Average Emissions`)
       ) +
  geom_point(color = 'lightsalmon3') + 

  theme(legend.position = "none") +
  labs(x = " Avg Renewable energy consumption %", 
       y = "", 
       title = "Observed Relationships between Avg Renewable\nEnergy Consumption and Avg CO2 Emissions", 
       subtitle = "Avg CO2 Emissions/1000 tonnes") +
  theme(plot.title.position = "plot",
        plot.title = element_text(face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))

sim_master_graph <- sim_data |> 
ggplot(aes(x = `Average Percent Renewable`, 
           y = sim_emissions)
       ) +
  geom_point(color = 'lightsalmon3') + 
  theme(legend.position = "none") +
  labs(x = " Avg Renewable energy consumption %", 
       y = "", 
       title = "Simulated Relationships between Avg Renewable\nEnergy Consumption and Avg CO2 Emissions", 
       subtitle = "Simulated Avg CO2 Emissions/1000 tonnes") +
    theme(plot.title.position = "plot",
        plot.title = element_text(face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))


raw_graph + sim_master_graph
<<<<<<< HEAD

=======

>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c

In these two graphs, we plotted the relationships between the simulated and observed, average renewable energy consumption and average CO2 emissions among different countries. We can recognize that the simulated data in the right graph is more closely related to a straight line, which is the modeled linear regression line.

Code
# Check the similarity between the simulated data and observed data
sim_data |> 
  ggplot(aes(x = sim_emissions, 
             y = `Average Emissions`)
         ) + 
  geom_point(color = 'darkolivegreen') + 
   labs(x = "Simulated Avg CO2 Emissions/1000 tonnes", 
        y = "",
        title = "The Similarity between Observed Data and Simulated Data",
        subtitle = "Avg CO2 Emissions/1000 tonnes" ) + 
  geom_abline(slope = 1,
              intercept = 0, 
              color = "steelblue",
              linetype = "dashed",
              lwd = 1.5) +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, face = 'bold'))
<<<<<<< HEAD

=======

>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c

In this graph, we can examine the similarity between the observed data set and the simulated data set. Ideally, the data should fall on the blue dashed line, which represents the y=x line. The points above the blue line represent an overestimated simulation, while the points below the line indicate an underestimated simulation. Following an analysis of this study, we can determine that the observed data is typically underestimated. Generally speaking, the simulated values are distributed close to the y=x line, meaning that there is a strong relationship between the observed values and simulated values.

Code
# get the R squared for the simulated data fit for observed data.
p_value <- lm(`Average Emissions` ~ sim_emissions, 
   data = sim_data
   ) |> 
  glance() |>
  select(p.value) |>
  pull()

sim_r2 <- lm(`Average Emissions` ~ sim_emissions, 
             data = sim_data
             ) |> 
  glance() |> 
  select(r.squared) |> 
  pull()

simulated_stats <- tibble(p_value, sim_r2) |>
  rename(`P - Value` = p_value,
         `Simulated R Squared` = sim_r2)


simulated_stats |>
  kable(caption = 'Simulated R Squared and P-Value', digits = 10) |>
  kable_styling(bootstrap_options = c('striped', 'bordered'))
<<<<<<< HEAD ======= >>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c
Simulated R Squared and P-Value
P - Value Simulated R Squared
0 0.8273886 0.8861348

The P - value is 0, and it is extremely small (p<0.05), which indicates the data is statistically significant.

<<<<<<< HEAD

\(R^2\) is 0.8273886, which means the simulated data can explain around 0.8273886 of the variation in observed data. In this case, we suppose that the one time simulation till now did a moderately good job.

=======

\(R^2\) is 0.8861348, which means the simulated data can explain around 0.8861348 of the variation in observed data. In this case, we suppose that the one time simulation till now did a moderately good job.

>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c

Simulation for a 1000 data sets:

Since the normally distributed error is randomly added to the linear regression model line, if we iterate the simulation many times, the result will be far more stable. With this method, we can get the entire group of interest, rather than a single sample of that group.

In the following steps, we created 1000 simulated data sets and combined the observed data set into a full new data set. Next, we also determined the \(R^2\) of the 1000 simulated data sets and plotted their distribution.

Code
# Created 1000 simulated dataset
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
                .f = ~ tibble(sim = noise(master_predict, 
                                          sd = master_sigma)
                              )
                )

# clean the colnames
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

sims <- master |> 
  filter(!is.na(`Average Emissions`), 
         !is.na(`Average Percent Renewable`)) |> 
  select(`Average Emissions`) |> 
  bind_cols(sims)


# mapping to get 1000 simulated R squared.
sim_r_sq <- sims |> 
  map(~ lm(`Average Emissions` ~ .x, data = sims)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)

# to see the distribution of the 1000 simulated R square.
tibble(sims = sim_r_sq) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.025, fill = 'lightpink3') +
  labs(x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models",
       title = "1000 Simulated R-Squared Distribution")+
  theme(plot.title = element_text(hjust = 0.5, face = 'bold'))
<<<<<<< HEAD

=======

>>>>>>> 0da3565d7a4ae51752152d44853cb9d35df3512c
Code
#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.
# Our model is R square is around 0.8.

We can see from the distribution of the 1000 simulated \(R^2\) graph that the \(R^2\) have the highest frequency around 0.8 to 0.85, which is in accordance with the single simulated data \(R^2\).

4 Conclusion



While the Earth may seem prosperous to us today, there is only a finite amount of resources, and life, on the planet and the human population just hit 8 billion. Global temperatures continue to increase threatening life and habit for many living organisms. Protecting humanity’s environment should be society’s number one priority and thought of with every decision made. Energy use, as well CO2 emissions, is also on the rise while the amount of renewable energy the world is using is decreasing.

Through analyzing this data set it is clear there is strong, statistically significant effect of using renewable energy on the decreasing the amount of CO2 emitted into the atmosphere yet humanity is using less, moving in the wrong direction.

We urge every country to prioritize and help solve this global crisis putting the Earth at risk.

5 References

  1. de Paula Ferreira, W., Armellini, F., & De Santa-Eulalia, L. A. (2020). Simulation in industry 4.0: A state-of-the-art review. Computers & Industrial Engineering, 149, 106868. https://doi.org/10.1016/J.CIE.2020.106868

  2. Gapminder. (2019). Data. Gapminder.org; Gapminder. https://www.gapminder.org/data/

  3. Scatter and Line Plots in R. (n.d.). Plotly.com. Retrieved March 17, 2023, from https://plotly.com/r/line-and-scatter/#custom-color-scales

  4. Schork, J. (n.d.). How to Use ggplotly in R (2 Examples) | Static to Interactive Plot. Statistics Globe. Retrieved March 17, 2023, from https://statisticsglobe.com/ggplotly-function-r#:~:text=In%20this%20one%2C%20you%E2%80%99ll%20learn%20how%20to%20use

  5. Subtitles with ggplotly. (2019, May 16). https://datascott.com/blog/subtitles-with-ggplotly/